Creates a new way to test multiple selections for each protein; parallelizes LDDT script#116
Creates a new way to test multiple selections for each protein; parallelizes LDDT script#116marcuscollins merged 6 commits intomainfrom
Conversation
# Conflicts: # src/sampleworks/core/rewards/real_space_density.py # Conflicts: # src/sampleworks/utils/atom_array_utils.py
Parallelizing LDDT script with joblib. Closes #98
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughParallelizes LDDT evaluation by splitting per-protein selections into independent jobs, expands reference-loading and selection handling to be per-selection, updates dataclasses and selection utilities to support multiple selections, and aggregates per-selection results into a unified dataframe. Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI / Main Script
participant Filter as Experiment Filter
participant RefLoader as Reference Loader
participant Parallel as joblib.Parallel
participant Worker as process_exp_with_selection
participant Aggregator as Result Aggregator
CLI->>Filter: load experiments & apply case-insensitive protein matching
Filter->>RefLoader: request reference coords per (protein, occ, selection)
RefLoader-->>Filter: return nested dict [(protein,occ)][selection]->coords
CLI->>Parallel: dispatch (exp, selection) tasks
Parallel->>Worker: for each (exp,selection) -> load predicted structure
Worker->>Worker: apply selection mask / clustering / compute LDDT
Worker-->>Aggregator: return per-selection result dict (metrics or error)
Aggregator->>Aggregator: combine results + null_results -> dataframe
Aggregator-->>CLI: write CSV / return dataframe
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
scripts/eval/lddt_evaluation_script.py (1)
223-241: Load each reference structure once per occupancy, then fan out by selection.The current product loop reloads and reparses the same reference structure for every selection, which is avoidable overhead.
⚡ Proposed refactor
- for protein_key, protein_config in protein_configs.items(): - for occ, sel in itertools.product(args.occupancies, protein_config.selection): - ref_path, reference_proteins = get_reference_atomarraystack(protein_config, occ) + for protein_key, protein_config in protein_configs.items(): + for occ in args.occupancies: + ref_path, reference_proteins = get_reference_atomarraystack(protein_config, occ) if reference_proteins is None: logger.warning( f"Could not find ref structure for {protein_key} and occupancy {occ}" ) continue + if (protein_key, occ) not in reference_atom_arrays: + reference_atom_arrays[(protein_key, occ)] = {} + + for sel in protein_config.selection: + try: + reference_protein_stack, _, _ = map_altlocs_to_stack( + reference_proteins, + selection=translate_selection(sel), + return_full_array=True, + ) + reference_atom_arrays[(protein_key, occ)][sel] = reference_protein_stack + except Exception as e: + logger.error( + f"Error loading ref structure for {protein_key} and occupancy {occ}: {e}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/lddt_evaluation_script.py` around lines 223 - 241, The loop currently calls get_reference_atomarraystack and reparses the same reference for every selection because itertools.product over args.occupancies and protein_config.selection repeats loads; change the logic to iterate over occupancies first (for occ in args.occupancies), call get_reference_atomarraystack(protein_config, occ) once, skip if None, then inside that occupancy iterate over selections (for sel in protein_config.selection) and call map_altlocs_to_stack(reference_proteins, selection=translate_selection(sel), return_full_array=True) and store into reference_atom_arrays[(protein_key, occ)][sel]; ensure the existing caching check for (protein_key, occ) remains so each reference is only stored once per occupancy.scripts/eval/rscc_grid_search_script.py (1)
127-165: Avoid recomputing refined density once per selection.Parsing CIF and recomputing density inside the selection loop scales linearly with number of selections and can dominate runtime. Compute predicted density once per experiment, then only run per-selection extraction/RSCC.
⚡ Refactor sketch
- for selection in protein_config.selection: + # Load refined structure and compute predicted density once per experiment + _structure = parse(_exp.refined_cif_path, ccd_mirror_path=None) + atom_array = get_asym_unit_from_structure(_structure) + if not hasattr(atom_array, "coord") or atom_array.coord is None: + raise AttributeError("AtomArray | AtomArrayStack is missing coordinates") + if not hasattr(atom_array, "b_factor"): + atom_array.set_annotation("b_factor", np.full(atom_array.coord.shape[-2], 20.0)) + + for selection in protein_config.selection: ... - _structure = parse(_exp.refined_cif_path, ccd_mirror_path=None) - atom_array = get_asym_unit_from_structure(_structure) - ... _computed_density, _ = compute_density_from_atomarray( atom_array, xmap=_base_xmap, em_mode=False, device=_device )You can also split cache keys so full base maps are cached by
(protein, occ_a)and extracted submaps by(protein, occ_a, selection).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/rscc_grid_search_script.py` around lines 127 - 165, The code recomputes parsing and computed density inside the per-selection loop (parse, get_asym_unit_from_structure, compute_density_from_atomarray), causing O(selections) work; move CIF parsing and density computation out of the selection loop and cache the results per experiment/occ_a. Specifically: parse the refined CIF via parse(...) once for each _exp and produce atom_array/get_asym_unit_from_structure once, then call compute_density_from_atomarray(...) once to produce a cached _computed_density (keyed by (protein, _exp.occ_a) or a separate computed_density_cache). Inside the selection loop keep only loading/_base_xmap extraction and per-selection _extracted_base handling, and reuse the cached _computed_density instead of calling parse or compute_density_from_atomarray again.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/eval/rscc_grid_search_script.py`:
- Around line 180-193: The code mutates the shared Experiment object _exp across
selections causing state leakage; fix by working on a fresh shallow/deep copy
before any mutation — e.g., create a copy of _exp (use copy.deepcopy or convert
to dict first) and perform assignments to that copy (set rscc, base_map_path,
error) and then append the copy to results; ensure you stop mutating the
original _exp (referenced symbols: _exp, exp_dict_copy, selection, results,
rscc) so each selection row is independent.
In `@src/sampleworks/eval/eval_dataclasses.py`:
- Line 193: The CSV parsing currently uses
selection=row["selection"].strip().split(";") which can produce empty or
whitespace-only entries; update the parsing logic where the selection field is
built so you split on ";" then trim each item and filter out empty strings
(e.g., split, map strip, filter out falsy items) and then validate the resulting
list (e.g., ensure it's non-empty or that each item matches allowed choices)
before assigning to the selection field; adjust any error handling to raise a
clear exception or log invalid selections so downstream code that consumes
selection receives a normalized, validated list.
In `@src/sampleworks/eval/structure_utils.py`:
- Around line 369-373: The selection parsing in extract_selection_coordinates()
is inconsistent with apply_selection() and uses atom_array to compute the mask
(fragile for stacked inputs); update the condition to check for ("==", ">", "<",
"<=", ">=", " in ") and, in the else branch, compute mask from working_array
(use working_array.mask(selection) instead of atom_array.mask(selection)); keep
get_mask_from_old_selection_string(...) call for legacy parsing as-is and ensure
variable name mask is assigned from working_array when using the modern
selection path.
In `@src/sampleworks/utils/atom_array_utils.py`:
- Around line 295-303: After building altloc_ids and altloc_list via
find_all_altloc_ids and select_altloc, guard against cases where altloc_list has
fewer than two entries: if altloc_list is empty return an empty result (or the
function's appropriate zero-state), and if it contains exactly one element skip
calling filter_to_common_atoms and use that single selection as atom_arrays
(respecting return_full_array behavior); implement this check just before the
call to filter_to_common_atoms to avoid the downstream generic error.
- Around line 195-203: In find_all_altloc_ids(), hasattr(atom_array,
altloc_label) can be True even when the attribute value is None; change the
guard to fetch the attribute with getattr(atom_array, altloc_label, None),
validate that the returned altloc_ids is not None (and is the expected
array-like), and if it is None raise the same AttributeError (or a clear
ValueError) mentioning altloc_label and that the annotation is missing/None;
then proceed to call np.unique on altloc_ids and return set(altloc_ids.tolist())
- BLANK_ALTLOC_IDS.
---
Nitpick comments:
In `@scripts/eval/lddt_evaluation_script.py`:
- Around line 223-241: The loop currently calls get_reference_atomarraystack and
reparses the same reference for every selection because itertools.product over
args.occupancies and protein_config.selection repeats loads; change the logic to
iterate over occupancies first (for occ in args.occupancies), call
get_reference_atomarraystack(protein_config, occ) once, skip if None, then
inside that occupancy iterate over selections (for sel in
protein_config.selection) and call map_altlocs_to_stack(reference_proteins,
selection=translate_selection(sel), return_full_array=True) and store into
reference_atom_arrays[(protein_key, occ)][sel]; ensure the existing caching
check for (protein_key, occ) remains so each reference is only stored once per
occupancy.
In `@scripts/eval/rscc_grid_search_script.py`:
- Around line 127-165: The code recomputes parsing and computed density inside
the per-selection loop (parse, get_asym_unit_from_structure,
compute_density_from_atomarray), causing O(selections) work; move CIF parsing
and density computation out of the selection loop and cache the results per
experiment/occ_a. Specifically: parse the refined CIF via parse(...) once for
each _exp and produce atom_array/get_asym_unit_from_structure once, then call
compute_density_from_atomarray(...) once to produce a cached _computed_density
(keyed by (protein, _exp.occ_a) or a separate computed_density_cache). Inside
the selection loop keep only loading/_base_xmap extraction and per-selection
_extracted_base handling, and reuse the cached _computed_density instead of
calling parse or compute_density_from_atomarray again.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
scripts/eval/lddt_evaluation_script.pyscripts/eval/rscc_grid_search_script.pysrc/sampleworks/core/rewards/real_space_density.pysrc/sampleworks/eval/eval_dataclasses.pysrc/sampleworks/eval/grid_search_eval_utils.pysrc/sampleworks/eval/structure_utils.pysrc/sampleworks/utils/atom_array_utils.pysrc/sampleworks/utils/density_utils.py
k-chrispens
left a comment
There was a problem hiding this comment.
A few nits but won't hold up over them!
| # this should be a temporary measure only until we switch to atomworks style in the RSCC script | ||
|
|
||
| logger.warning( | ||
| if any(x in selection for x in ("==", ">", "<", "<=", ">=", " in ")): |
There was a problem hiding this comment.
I think this is probably sufficient, but I am not sure. Maybe it would be better to detect the pyMOL like selection syntax? We should also write (or use if it exists) an atomworks/pandas <-> pyMOL style translator so that it is easy to port directly into pymol scripts.
There was a problem hiding this comment.
I'll make an issue to solve this. I have a translation method but it only handles some cases. We definitely need a more complete solution.
|
|
||
|
|
||
| def process_exp_with_selection( | ||
| exp: Experiment, |
There was a problem hiding this comment.
I'm curious - how should we name these given that we will be interacting with the diffuse hub schema? should we align how we describe experiments for ourselves both here and there?
There was a problem hiding this comment.
It's a good question. Personally I think the way to do this is to push all the metadata into the CIF files, and then we can do away with this kind of parsing.
Oh, but you're asking whether we should rename the "Experiment" class here? Yes, it would make sense to rename it.
| ---------- | ||
| exp: Experiment, a description of the structure generation experiment | ||
| protein_config: ProteinConfig, specifying the locations of reference structures and maps | ||
| px_seln_refernce_atom_array: AtomArrayStack, |
There was a problem hiding this comment.
typo here that would be worth not propogating I think: px_seln_refernce_atom_array -> px_seln_reference_atom_array
| f"({n_valid}/{n_total} atoms remaining)" | ||
| ) | ||
|
|
||
| # TODO can we use [..., valid_mask] here? |
There was a problem hiding this comment.
I don't see why not, though I do think that coords is only going to be 2 dimensional here by requirement of AtomArrayStack or AtomArray
…tom selection capabilities for eval scripts)
Sorry, this is a pretty big one. All it should do is what's in the title though.
Summary by CodeRabbit
New Features
Configuration Changes
Improvements
Utilities